\chapter{Decision Trees} \label{sec:dt}

\chapterquote{The words printed here are concepts.\\You must go through the experiences.}{Carl~Frederick}


%\chapterimageopt{serious.png}{width=5cm}{Courtesy XKCD}

\begin{learningobjectives}
\item Explain the difference between memorization and generalization.
\item Implement a decision tree classifier.
\item Take a concrete task and cast it as a learning problem, with a
  formal notion of input space, features, output space, generating
  distribution and loss function.
\end{learningobjectives}

\dependencies{None.}

\newthought{At a basic level,} machine learning is about predicting
the future based on the past.  For instance, you might wish to predict
how much a user Alice will like a movie that she hasn't seen, based on
her ratings of movies that she has seen. This prediction could be based
on many factors of the movies: their category (drama, documentary, etc.),
the language, the director and actors, the production company, etc.
In general, this means making informed
guesses about some unobserved property of some object, based on
observed properties of that object.

The first question we'll ask is: what does it mean to learn?  In order
to develop learning machines, we must know what learning actually
means, and how to determine success (or failure).  You'll see this
question answered in a very limited learning setting, which will be
progressively loosened and adapted throughout the rest of this book.
For concreteness, our focus will be on a very simple model of learning
called a \concept{decision tree}.

\section{What Does it Mean to Learn?}

Alice has just begun taking a course on machine learning.  She knows
that at the end of the course, she will be expected to have ``learned''
all about this topic.  A common way of gauging whether or not she has
learned is for her teacher, Bob, to give her a exam.  She has done
well at learning if she does well on the exam.

But what makes a reasonable exam?  If Bob spends the entire semester
talking about machine learning, and then gives Alice an exam on
History of Pottery, then Alice's performance on this exam will
\emph{not} be representative of her learning.  On the other hand, if
the exam only asks questions that Bob has answered exactly during lectures,
then this is also a bad test of Alice's learning, especially if it's
an ``open notes'' exam.  What is desired is that Alice observes
\emph{specific} examples from the course, and then has to answer new, but
related questions on the exam.  This tests whether Alice has the
ability to \concept{generalize}.  Generalization is perhaps the most
central concept in machine learning.

As a concrete example, consider a
course recommendation system for undergraduate computer science
students.  We have a collection of students and a collection of
courses.  Each student has taken, and evaluated, a subset of the
courses.  The evaluation is simply a score from $-2$ (terrible) to
$+2$ (awesome).  The job of the recommender system is to
\concept{predict} how much a particular student (say, Alice) will like
a particular course (say, Algorithms).

Given historical data from course ratings (i.e., the past) we are
trying to predict unseen ratings (i.e., the future).  Now, we could be
unfair to this system as well.  We could ask it whether Alice is
likely to enjoy the History of Pottery course.  This is unfair because
the system has no idea what History of Pottery even is, and has no
prior experience with this course.  On the other hand, we could ask it
how much Alice will like Artificial Intelligence, which she took last
year and rated as $+2$ (awesome).  We would expect the system to
predict that she would really like it, but this isn't demonstrating
that the system has learned: it's simply recalling its past
experience.  In the former case, we're expecting the system to
generalize \emph{beyond} its experience, which is unfair.  In the
latter case, we're not expecting it to generalize at all.

This general set up of predicting the future based on the past is at
the core of most machine learning.  The objects that our algorithm
will make predictions about are \concept{examples}.  In the
recommender system setting, an example would be some particular
Student/Course pair (such as Alice/Algorithms).  The desired
prediction would be the rating that Alice would give to Algorithms.

\Figure{dt_induction}{The general supervised approach to machine
  learning: a learning algorithm reads in training data and computes a
  learned function $f$.  This function can then automatically label
  future text examples.}

To make this concrete, Figure~\ref{fig:dt_induction} shows the general
framework of \concept{induction}.  We are given \concept{training
  data} on which our algorithm is expected to learn.  This training
data is the examples that Alice observes in her machine learning
course, or the historical ratings data for the recommender system.
Based on this training data, our learning algorithm induces a function
$f$ that will map a new example to a corresponding prediction.  For
example, our function might guess that $f(\text{Alice/Machine
  Learning})$ might be high because our training data said that Alice
liked Artificial Intelligence.  We want our algorithm to be able to
make lots of predictions, so we refer to the collection of examples on
which we will evaluate our algorithm as the \concept{test set}.  The
test set is a closely guarded secret: it is the final exam on which
our learning algorithm is being tested.  If our algorithm gets to peek
at it ahead of time, it's going to cheat and do better than it should.

\thinkaboutit{Why is it bad if the learning algorithm gets to peek at the test data?}

The goal of inductive machine learning is to take some training data
and use it to induce a function $f$.  This function $f$ will be
evaluated on the test data.  The machine learning algorithm has
succeeded if its performance on the test data is high.

\section{Some Canonical Learning Problems}

There are a large number of typical inductive learning problems.  The
primary difference between them is in what type of \emph{thing}
they're trying to predict.  Here are some examples:

\begin{description}
\item[Regression:] trying to predict a real value.  For instance,
  predict the value of a stock tomorrow given its past performance.
  Or predict Alice's score on the machine learning final exam based on
  her homework scores.

\item[Binary Classification:] trying to predict a simple yes/no
  response.  For instance, predict whether Alice will enjoy a course
  or not.  Or predict whether a user review of the newest Apple
  product is positive or negative about the product.

\item[Multiclass Classification:] trying to put an example into one of
  a number of classes.  For instance, predict whether a news story is
  about entertainment, sports, politics, religion, etc.  Or predict
  whether a CS course is Systems, Theory, AI or Other.

\item[Ranking:] trying to put a set of objects in order of relevance.
  For instance, predicting what order to put web pages in, in response
  to a user query.  Or predict Alice's ranked preferences over courses
  she hasn't taken.
\end{description}

\thinkaboutit{For each of these types of canonical machine learning
  problems, come up with one or two concrete examples.}

The reason that it is convenient to break machine learning problems
down by the type of object that they're trying to predict has to do
with measuring error.  Recall that our goal is to build a system that
can make ``good predictions.''  This begs the question: what does it
mean for a prediction to be ``good?''  The different types of learning
problems differ in how they define goodness.  For instance, in
regression, predicting a stock price that is off by $\$0.05$ is
perhaps much better than being off by $\$200.00$.  The same does not
hold of multiclass classification.  There, accidentally predicting
``entertainment'' instead of ``sports'' is no better or worse than
predicting ``politics.''


\section{The Decision Tree Model of Learning}

The \concept{decision tree} is a classic and natural model of
learning.  It is closely related to the fundamental computer science
notion of ``divide and conquer.''  Although decision trees can be
applied to many learning problems, we will begin with the simplest
case: binary classification.

Suppose that your goal is to predict whether some unknown user will
enjoy some unknown course.  You must simply answer ``yes'' or ``no.''
In order to make a guess, you're allowed to ask binary questions
about the user/course under consideration.  For example:

{\bf You:} Is the course under consideration in Systems?\\
{\bf Me:}  Yes\\
{\bf You:} Has this student taken any other Systems courses?\\
{\bf Me:}  Yes\\
{\bf You:} Has this student liked most previous Systems courses?\\
{\bf Me:}  No\\
{\bf You:} \emph{I predict this student will not like this course.}\\

The goal in learning is to figure out what questions to ask, in what
order to ask them, and what answer to predict once you have asked
enough questions.

\MoveNextFigure{-10cm}
\Figure{dt_example}{A decision tree for a course recommender system,
  from which the in-text ``dialog'' is drawn.}

The decision tree is so-called because we can write our set of
questions and guesses in a tree format, such as that in
Figure~\ref{fig:dt_example}.  In this figure, the questions are
written in the internal tree nodes (rectangles) and the guesses are
written in the leaves (ovals).  Each non-terminal node has two
children: the left child specifies what to do if the answer to the
question is ``no'' and the right child specifies what to do if it is
``yes.''

In order to learn, I will give you training data.  This data consists
of a set of user/course examples, paired with the correct answer for
these examples (did the given user enjoy the given course?).  From
this, you must construct your questions.  For concreteness, there is a
small data set in Table~\ref{tab:data:course} in the Appendix of this
book.  This training data consists of 20 course rating examples, with
course ratings and answers to questions that you might ask about this
pair.  We will interpret ratings of $0$, $+1$ and $+2$ as ``liked'' and
ratings of $-2$ and $-1$ as ``hated.''

In what follows, we will refer to the questions that you can ask as
\concept{features} and the responses to these questions as
\concept{feature values}.  The rating is called the \concept{label}.
An example is just a set of feature values.  And our training data is
a set of examples, paired with labels.

There are a lot of logically possible trees that you could build, even
over just this small number of features (the number is in the
millions).  It is computationally infeasible to consider all of these
to try to choose the ``best'' one.  Instead, we will build our
decision tree \emph{greedily.}  We will begin by asking:

{\bf If I could only ask one question, what question would I ask?}


\MoveNextFigure{-10cm}
\Figure{dt_histogram}{A histogram of labels for (a) the entire data
  set; (b-e) the examples in the data set for each value of the first
  four features.}

You want to find a feature that is \emph{most useful} in helping you
guess whether this student will enjoy this course.
A useful way to think about this is to look at the \concept{histogram}
of labels for each feature.  
\sidenote{A
  colleague related the story of getting his 8-year old nephew to
  guess a number between 1 and 100.  His nephew's first four questions
  were: Is it bigger than 20?  (YES) Is it even?  (YES) Does it have a
  7 in it?  (NO) Is it 80?  (NO).  It took 20 more questions to get
  it, even though 10 should have been sufficient.  At 8, the nephew
  hadn't quite figured out how to divide and conquer.
\url{http://blog.computationalcomplexity.org/2007/04/getting-8-year-old-interested-in.html}.}
This is shown for the first four features
in Figure~\ref{fig:dt_histogram}.  Each histogram shows the frequency
of ``like''/``hate'' labels for each possible value of an associated
feature.  From this figure, you can see that asking the first feature
is not useful: if the value is ``no'' then it's hard to guess the
label; similarly if the answer is ``yes.''  On the other hand, asking
the second feature \emph{is} useful: if the value is ``no,'' you can
be pretty confident that this student will hate this course; if the
answer is ``yes,'' you can be pretty confident that this student will
like this course.

More formally, you will consider each feature in turn.  You might
consider the feature ``Is this a System's course?''  This feature has
two possible value: no and yes.  Some of the training examples have an
answer of ``no'' -- let's call that the ``NO'' set.  Some of the
training examples have an answer of ``yes'' -- let's call that the
``YES'' set.  For each set (NO and YES) we will build a histogram over
the labels.  This is the second histogram in
Figure~\ref{fig:dt_histogram}.  Now, suppose you were to ask this
question on a random example and observe a value of ``no.''  Further
suppose that you must \emph{immediately} guess the label for this
example.  You will guess ``like,'' because that's the more prevalent
label in the NO set (actually, it's the \emph{only} label in the NO
set).  Alternatively, if you receive an answer of ``yes,'' you will
guess ``hate'' because that is more prevalent in the YES set.

So, for this single feature, you know what you \emph{would} guess if
you had to.  Now you can ask yourself: if I made that guess on the
\emph{training data,} how well would I have done?  In particular, how
many examples would I classify \emph{correctly?}  In the NO set (where
you guessed ``like'') you would classify all $10$ of them correctly.
In the YES set (where you guessed ``hate'') you would classify $8$
(out of $10$) of them correctly.  So overall you would classify $18$
(out of $20$) correctly.  Thus, we'll say that the \emph{score} of the
``Is this a System's course?'' question is $18/20$.

\thinkaboutit{How many training examples would you classify correctly
  for each of the other three features from
  Figure~\ref{fig:dt_histogram}?}

You will then repeat this computation for each of the available
features to us, compute the scores for each of them.  When you must
choose which feature consider first, you will want to choose the one
with the highest score.

But this only lets you choose the \emph{first} feature to ask about.
This is the feature that goes at the \emph{root} of the decision tree.
How do we choose subsequent features?  This is where the notion of
divide and conquer comes in.  You've already decided on your first
feature: ``Is this a Systems course?''  You can now \emph{partition}
the data into two parts: the NO part and the YES part.  The NO part is
the subset of the data on which value for this feature is ``no''; the
YES half is the rest.  This is the \emph{divide} step.

The \emph{conquer} step is to recurse, and run the \emph{same}
routine (choosing the feature with the highest score) on the NO set
(to get the left half of the tree) and then separately on the YES set
(to get the right half of the tree).

At some point it will become useless to query on additional features.
For instance, once you know that this is a Systems course, you
\emph{know} that everyone will hate it.  So you can immediately
predict ``hate'' without asking any additional questions.  Similarly,
at some point you might have already queried every available feature
and still not whittled down to a single answer.  In both cases, you
will need to create a leaf node and guess the most prevalent answer in
the current piece of the training data that you are looking at.


\newalgorithm%
  {dt:train}%
  {\FUN{DecisionTreeTrain}(\VAR{data}, \VAR{remaining features})}%
  {
\SETST{guess} most frequent answer in \VAR{data} \COMMENT{default answer for this data}
\IF{the labels in \VAR{data} are unambiguous}
\RETURN \STR{Leaf}(\VAR{guess}) \COMMENT{base case: no need to split further}
\ELSIF{\VAR{remaining features} is empty}
\RETURN \STR{Leaf}(\VAR{guess}) \COMMENT{base case: cannot split further}
\ELSE[we need to query more features]
\FORALL{\VAR{f} $\in$ \VAR{remaining features}}
\SETST{NO} the subset of \VAR{data} on which \VAR{f}=\CON{no}
\SETST{YES} the subset of \VAR{data} on which \VAR{f}=\CON{yes}
\STATE \VAR{score}[\VAR{f}] $\leftarrow$ \# of majority vote answers in \VAR{NO}
\STATE \quad\quad\quad\quad + \# of majority vote answers in \VAR{YES}\\
\COMMENT{the accuracy we would get if we only queried on \VAR{f}}
\ENDFOR
\SETST{f} the feature with maximal \VAR{score}(\VAR{f})
\SETST{NO} the subset of \VAR{data} on which \VAR{f}=\CON{no}
\SETST{YES} the subset of \VAR{data} on which \VAR{f}=\CON{yes}
\SETST{left} \FUN{DecisionTreeTrain}(\VAR{NO}, \VAR{remaining features} $\without$ $\{$\VAR{f}$\}$)
\SETST{right} \FUN{DecisionTreeTrain}(\VAR{YES}, \VAR{remaining features} $\without$ $\{$\VAR{f}$\}$)
\RETURN \STR{Node}(\VAR{f}, \VAR{left}, \VAR{right})
\ENDIF
}

\newalgorithm%
  {dt:predict}%
  {\FUN{DecisionTreeTest}(\VAR{tree}, \VAR{test point})}
  {
\IF{\VAR{tree} is of the form \STR{Leaf}(\VAR{guess})}
\RETURN \VAR{guess}
\ELSIF{\VAR{tree} is of the form \STR{Node}(\VAR{f}, \VAR{left}, \VAR{right})}
\IF{\VAR{f} $=$ \CON{no} in \VAR{test point}}
\RETURN \FUN{DecisionTreeTest}(\VAR{left}, \VAR{test point})
\ELSE
\RETURN \FUN{DecisionTreeTest}(\VAR{right}, \VAR{test point})
\ENDIF
\ENDIF
}

Putting this all together, we arrive at the algorithm shown in
Algorithm~\ref{alg:dt:train}.\sidenote{There are more nuanced
  algorithms for building decision trees, some of which are discussed
  in later chapters of this book.  They primarily differ in how they
  compute the \emph{score} function.}  This function,
\FUN{DecisionTreeTrain} takes two arguments: our data, and the set of
as-yet unused features.  It has two base cases: either the data is
unambiguous, or there are no remaining features.  In either case, it
returns a \STR{Leaf} node containing the most likely guess at this
point.  Otherwise, it loops over all remaining features to find the
one with the highest score.  It then partitions the data into a NO/YES
split based on the best feature.  It constructs its left and right
subtrees by recursing on itself.  In each recursive call, it uses one
of the partitions of the data, and removes the just-selected feature
from consideration.

\thinkaboutit{Is Algorithm~\ref{alg:dt:train}
  guaranteed to terminate?}

The corresponding \emph{prediction} algorithm is shown in
Algorithm~\ref{alg:dt:predict}.  This function recurses down the decision
tree, following the edges specified by the feature values in some
\VAR{test point}.  When it reaches a leaf, it returns the guess
associated with that leaf.

\section{Formalizing the Learning Problem}

As you've seen, there are several issues that we must take into
account when formalizing the notion of learning.

\begin{itemize}
\item The performance of the learning algorithm should be measured on
  unseen ``test'' data.

\item The way in which we measure performance should depend on the
  problem we are trying to solve.

\item There should be a strong relationship between the data that our
  algorithm sees at training time and the data it sees at test time.
\end{itemize}

In order to accomplish this, let's assume that someone gives us a
\concept{loss function}, $\ell(\cdot,\cdot)$, of two arguments.  The
job of $\ell$ is to tell us how ``bad'' a system's prediction is in
comparison to the truth.  In particular, if $y$ is the truth and $\hat
y$ is the system's prediction, then $\ell(y,\hat y)$ is a measure of
error.

For three of the canonical tasks discussed above, we might use the
following loss functions:

\begin{description}
\item[Regression:] \concept{squared loss} $\ell(y,\hat y) = (y - \hat
  y)^2$\\ or \concept{absolute loss} $\ell(y,\hat y) = \ab{y - \hat y}$.

\item[Binary Classification:] \concept{zero/one loss} $\ell(y,\hat y)
  = \brack{0 & \text{if } y = \hat y\\ 1 & \text{otherwise}}$
  \marginnote[-.5em]{This notation means that the loss is zero if the
    prediction is correct and is one otherwise.}

\item[Multiclass Classification:] also zero/one loss.
\end{description}

\thinkaboutit{Why might it be a bad idea to use zero/one loss to
  measure performance for a regression problem?}

Note that the loss function is something that \emph{you} must decide
on based on the goals of learning.

\begin{mathreview}{Expectated Values}
We write $\Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ]$ for the expected loss. Expectation means ``average.'' This is saying ``if you drew a bunch of $(x,y)$ pairs independently at random from $\cD$, what would your \emph{average} loss be?% (More formally, what would be the average of $\ell(y,f(\vx))$ be over these random draws?)
More formally, if $\cD$ is a discrete probability distribution, then this expectation can be expanded as:
%
\begin{equation}
  \Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ] = \sum_{(\vx,y) \in \cD} [ \cD(\vx,y) \ell(y, f(\vx)) ]
\end{equation}
%
This is \emph{exactly} the weighted average loss over the all $(\vx,y)$ pairs in $\cD$, weighted by their probability, $\cD(\vx,y)$.
If $D$ is a \emph{finite discrete distribution}, for instance defined by a finite data set $\{ (\vx_1,y_1), \dots, (\vx_N,y_N)$ that puts equal weight on each example (probability $1/N$), then we get:
%
\begin{align}
\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ] 
&= \sum_{(\vx,y) \in D} [ D(\vx,y) \ell(y, f(\vx)) ]
   \becauseof{definition of expectation}\\
&= \sum_{n=1}^N [ D(\vx_n,y_n) \ell(y_n, f(\vx_n)) ]
   \becauseof{$D$ is discrete and finite}\\
&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ] 
   \becauseof{definition of $D$}\\
&= \frac 1 N  \sum_{n=1}^N [ \ell(y_n, f(\vx_n)) ]
   \becauseof{rearranging terms}
\end{align}
%
Which is exactly the average loss on that dataset.
\\~\\
%In the case that the distribution is continuous, we need to replace the discrete sum with a continuous integral over some space $\Omega$:
%$  \Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ] = \int_{\Omega} \cD(\vx,y) \ell(y, f(\vx)) \ud \vx \ud y
%$.
%This is exactly the same but in continuous space rather than discrete space.
The most important thing to remember is that there are two equivalent ways to think about expections:
(1) The expectation of some function $g$ is the \emph{weighted average value of $g$}, where the weights are given by the underlying probability distribution.
(2) The expectation of some function $g$ is your \emph{best guess of the value of $g$} if you were to draw a single item from the underlying probability distribution.
\end{mathreview}

Now that we have defined our loss function, we need to consider where
the data (training \emph{and} test) comes from.  The model that we
will use is the \emph{probabilistic} model of learning.  Namely, there
is a probability distribution $\cD$ over input/output pairs.  This is
often called the \concept{data generating distribution}.  If we write
$\vx$ for the input (the user/course pair) and $y$ for the output (the
rating), then $\cD$ is a distribution over $(\vx,y)$ pairs.

A useful way to think about $\cD$ is that it gives \emph{high
  probability} to reasonable $(\vx,y)$ pairs, and \emph{low
  probability} to unreasonable $(\vx,y)$ pairs.  A $(\vx,y)$ pair can
be unreasonable in two ways.  First, $\vx$ might be an unusual input.
For example, a $\vx$ related to an ``Intro to Java'' course might be
highly probable; a $\vx$ related to a ``Geometric and Solid Modeling''
course might be less probable.  Second, $y$ might be an unusual rating
for the paired $\vx$.  For instance, if Alice were to take AI $100$
times (without remembering that she took it before!), she would give
the course a $+2$ almost every time.  Perhaps some semesters she might
give a slightly lower score, but it would be unlikely to see
$\vx=$Alice/AI paired with $y=-2$.

It is important to remember that we are not making \emph{any}
assumptions about what the distribution $\cD$ looks like.  (For
instance, we're not assuming it looks like a Gaussian or some other,
common distribution.)  We are also not assuming that we know what
$\cD$ is.  In fact, if you know \emph{a priori} what your data
generating distribution is, your learning problem becomes
significantly easier.  Perhaps the hardest thing about machine
learning is that we \emph{don't} know what $\cD$ is: all we get is a
random sample from it.  This random sample is our training data.

Our learning problem, then, is defined by two quantities:

\thinkaboutit{Consider the following prediction task.  Given a
  paragraph written about a course, we have to predict whether the
  paragraph is a \emph{positive} or \emph{negative} review of the
  course.  (This is the sentiment analysis problem.)  What is a
  reasonable loss function?  How would you define the data generating
  distribution?}
\begin{enumerate}
\item The loss function $\ell$, which captures our notion of what is
  \emph{important} to learn.

\item The data generating distribution $\cD$, which defines what sort
  of data we expect to see.
\end{enumerate}

We are given access to \concept{training data}, which is a random
sample of input/output pairs drawn from $\cD$.  Based on this training
data, we need to \concept{induce} a function $f$ that maps new inputs
$\hat \vx$ to corresponding prediction $\hat y$.  The key property
that $f$ should obey is that it should do well (as measured by $\ell$)
on future examples that are \emph{also} drawn from $\cD$.  Formally,
it's \concept{expected loss} $\ep$ over $\cD$ with repsect to $\ell$
should be as small as possible:
\begin{align} \label{eq:expectederror}
\ep
&\defeq
\Ep_{(\vx,y) \sim \cD} \big[ \ell(y, f(\vx)) \big]
=
\sum_{(\vx,y)} \cD(\vx,y) \ell(y, f(\vx))
\end{align}

The difficulty in minimizing our \concept{expected loss} from
Eq~\eqref{eq:expectederror} is that we \emph{don't know what $\cD$
  is!}  All we have access to is some training data sampled from it!
Suppose that we denote our training data set by $D$.  The training
data consists of $N$-many input/output pairs, $(\vx_1,y_1),
(\vx_2,y_2), \dots, (\vx_N,y_N)$.  Given a learned function $f$, we
can compute our \concept{training error}, $\hat \ep$:
\begin{align} \label{eq:trainingerror}
\hat \ep
&\defeq
\frac 1 N \sum_{n=1}^N \ell(y_n, f(\vx_n))
\end{align}

That is, our training error is simply our \emph{average error} over
the training data.  \thinkaboutit{Verify by calculation that we can
  write our training error as $\Ep_{(\vx,y) \sim D} \big[ \ell(y,
  f(\vx)) \big]$, by thinking of $D$ as a distribution that places
  probability $1/N$ to each example in $D$ and probability $0$ on
  everything else.}

Of course, we can drive $\hat \ep$ to zero by simply memorizing our
training data.  But as Alice might find in memorizing past exams, this
might not generalize well to a new exam!

This is the fundamental difficulty in machine learning: the thing we
have access to is our training error, $\hat \ep$.  But the thing we care
about minimizing is our expected error $\ep$.  In order to get the
expected error down, our learned function needs to
\concept{generalize} beyond the training data to some future data that
it might not have seen yet!

So, putting it all together, we get a formal definition of induction
machine learning: \bigemph{Given (i) a loss function $\ell$ and (ii) a
  sample $D$ from some unknown distribution $\cD$, you must compute a
  function $f$ that has low expected error $\ep$ over $\cD$ with
  respect to $\ell$.}

A very important comment is that we should \emph{never} expect a machine learning algorithm to generalize beyond the data distribution it has seen at training time.
In a famous---if posssibly apocryphal---example from the 1970s, the US Government wanted to train a classifier to distinguish between US tanks and Russian tanks.
They collected a training and test set, and managed to build a classifier with nearly 100\% accuracy on that data.
But when this classifier was run in the ``real world'', it failed miserably.
It had not, in fact, learned to distinguish between US tanks and Russian tanks, but rather just between clear photos and blurry photos.
In this case, there was a \emph{bias} in the training data (due to how the training data was collected) that caused the learning algorithm to learn something other than what we were hoping for.
We will return to this issue in Chapter~\ref{sec:bias}; for now, simply remember that the distribution $D$ for training data \emph{must match} the distribution $D$ for the test data.



\section{Chapter Summary and Outlook}

At this point, you should be able to use decision trees to do machine
learning.  Someone will give you data.  You'll split it into
training, development and test portions.  Using the training and
development data, you'll find a good value for maximum depth that
trades off between underfitting and overfitting.  You'll then run the
resulting decision tree model on the test data to get an estimate of
how well you are likely to do in the future.

You might think: why should I read the rest of this book?  Aside from
the fact that machine learning is just an awesome fun field to learn
about, there's a lot left to cover.  In the next two chapters, you'll
learn about two models that have very different inductive biases than
decision trees.  You'll also get to see a very useful way of thinking
about learning: the geometric view of data.  This will guide much of
what follows.  After that, you'll learn how to solve problems more
complicated that simple binary classification.  (Machine learning
people like binary classification a lot because it's one of the
simplest non-trivial problems that we can work on.)  After that,
things will diverge: you'll learn about ways to think about learning
as a formal optimization problem, ways to speed up learning, ways to
learn without labeled data (or with very little labeled data) and all
sorts of other fun topics.

But throughout, we will focus on the view of machine learning that
you've seen here.  You select a model (and its associated inductive
biases).  You use data to find parameters of that model that work well
on the training data.  You use development data to avoid underfitting
and overfitting.  And you use test data (which you'll never look at or
touch, right?) to estimate future model performance.  Then you conquer
the world.

%\section{Decision Trees with Real-Valued Features}


\section{Further Reading}

In our discussion of decision trees, we used \emph{misclassification rate} for selecting features.
While simple and intuitive, misclassification rate has problems.
There has been a significant amount of work that considers more advanced splitting criteria; the most popular is ID3, based on the mutual information quantity from information theory.
We have also only considered a very simple mechanism for controlling inductive bias: limiting the depth of the decision tree.
Again, there are more advanced ``tree pruning'' techniques that typically operate by growing deep trees and then pruning back some of the branches.
These approaches have the advantage that different branches can have different depths, accounting for the fact that the amount of data that gets passed down each branch might differ dramatically\mycite{quinlan}.


\begin{comment}
Predicting the future
 - Not memorizing the past (simple prediction problem)
 - Generalizing from known to unknown
 - Training versus test data

Models, algorithms, theory and experiments

Evaluation

Optimizing 0/1 loss

Underfitting/overfitting by depth

Pruning

From two classes to M classes
\end{comment}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "courseml"
%%% End: 
